Entering (inputting) and processing Chinese language text on a computer is a very difficult problem. The shear numbers of Chinese characters illustrate this difficulty. In the square-character (Hanzi) writing system of Chinese, there are 3000 to 6000 commonly used Chinese characters (Hanzi). Including the relatively rare ones, there are more than ten thousand Hanzi. Adding to this difficulty there are problems in the Chinese language with text standardization, multiple homonyms, and ill defined word boundaries that impede effective text processing of Hanzi with computers. In spite of intensive studies for several decades and the existence of hundreds, of different methods, computer input and processing of Hanzi is a major stumbling block preventing the use computers in China, particularly for text processing.
The computer systems available today for inputting and processing Chinese language text may be divided into three categories:
The first category is based on a decomposition of the square characters into elementary graphical components. Different keys on the keyboard are assigned to represent different elementary graphical components of a Hanzi. Then, each character can be keyed in with a few key strokes--as a combination of these elementary graphical components. Examples of this approach include Changji in Taiwan and the Five-Stroke method in mainland China. The major drawback of such methods is that the assignment of the keys to Hanzi components is artificial. In both the Changji and Five-Stroke methods, the assignment of the codes has to be mechanically memorized. Memorizing the keys representing the components is difficult and time consuming. Besides, the decomposition of a Hanzi into its elementary components in many cases is not unique. Although these methods are used by professional input operators, and high speed is demonstrated by well-trained typists, they are not used much by computer experts and other professionals, let alone ordinary people. Therefore, these methods tend to restrict the use of computers by the general Chinese speaking population.
The second and third category encounter a "homonym problem" in Chinese language processing.
The second category is phonetic input, (e.g. Pinyin for mainland China and "phonetic symbols" or BPMF for Taiwan) which is the most commonly used method for everyone except professional typists. The Hanzi writing system of Chinese language is a conceptual and practical barrier to this method.
Since there are only about 1300 different phonetic syllables, in contrast to tens of thousands of characters, one phonetic syllable may correspond to many different Hanzi. For example, the pronunciation of "yi" in Mandarin can correspond to over 100 Hanzi. This creates ambiguities when translating the phonetic syllables into Hanzi.
To address this "homonym problem," most of the phonetic input systems use a multiple-choice method. See for example, German patent 3,142,138, issued May 5, 1983, by J. Heinzl et al.; U.S. Pat. No. 5,047,932, issued Sep. 10, 1991, by K. C. Hsieh; and Chinese patent 1,064,957 issued Mar. 8, 1991, by Tan Shanguang. After a phonetic syllable is keyed in, the computer displays all possible Hanzi with the same pronunciation. In some cases, there is not enough space on the screen to display all possible characters with the same pronunciation. These cases require scrolling up and down. Therefore, these phonetic methods, based on individual syllables, are very slow.
An improvement to the multiple-choice methods based on deriving probability of the adjacent Hanzi is disclosed in the prior art. See for example, British patent 2,248,328, issued on Apr. 1, 1992 to R. W. Sproat. The probability approach can further be combined with grammatical constraints. See for example, K. T. Lua et al., Computer Processing of Chinese and Oriental Languages, Vol. 6, Num 1, page 85, June 1992. However, the conversion accuracy (phonetic to Hanzi) of these methods is typically limited to around 80%.
The third category combines a phonetic-character input method with the addition of non-phonetic letters. Non-phonetic letters are added to the phonetic letters to artificially discriminate characters with the same pronunciation. Examples include phonetic spelling with radical marks (British patent 2,158,776, issued Nov. 20, 1985, by C. C. Chen) and phonetic spelling with number of strokes (Chinese patent 1,066,518, issued Nov. 25, 1992, by G. Xie). These methods require memorizing artificial rules or counting number of strokes which slows down the speed of input substantially.
In addition to the "homonym problem," a "word boundary problem" exists when processing the Chinese language.
Although more than 80% of words in modern Chinese have multiple syllables (thus two or more Hanzi), there is no word separation in its writing system (in contrast with all European languages, and even Korean). Further, input of phonetic Chinese is usually performed syllable by syllable without accounting for word boundaries.
In spite of the wide recognition of multisyllable words and the lack of a standard way to delimit words at a word boundary, the definition and even the existence of words in Chinese is controversial. Furthermore, because Chinese is traditionally written a continuous string of Hanzi without word spacing, an ordinary Chinese person does not have a clear concept of what a "word" means. In many cases, it is unclear where a word boundary or delimiter, e.g. a space, should be placed. The controversy is exemplified by the following cases:
1. Compound nouns. In English, two independently valid words can be combined to form a compound noun, for example, blackboard or rattlesnake. As in English, controversy exists about whether these compound strings should be treated as one word or two words. Because there is no generally accepted precedence in China, controversy about compound nouns is much more severe. For example, the work "nanguangboyuan" (male announcer), as listed in Chinese Pinyin Vocabulary, may be considered as two words (nan guangboyuan), or even three words (nan guangbo yuan) by different people. PA0 2. Affixes, All Chinese verbs can be appended with "syntax units" -le, -guo, or -zhe, which make them past, present prefect, or progressive tense. All adjectives can be appended with -de. However, these syntax units also appear as individual words called particles. Different schools of linguists treat these syntax units differently. Some schools treat these syntax units as "proper" affixes, i.e. part of the word to which they are attached. Other schools treat them as individual particles, i.e., separate words. PA0 3. Compound verbs. There is a class of verbs in Chinese which is very similar to the divisible verbs in German (die zerbrechbar Zeitwort), such as aufzichen, heraufziehen, etc. Those "divisible" verbs can use infixes, -zu- and -ge-, to become infinitives or past participles.
An affix is part of a word while a particle is an individual word. For example, while the noun endings, -hua, -jia, -yuan, -xing, and -zhuyi are considered by most linguists as affixes in single words, some linguists consider them individual particles (separate words). On the other hand, endings such as -z, -r, and -tou are always treated as suffixes for nouns and not as individual particles.
In Chinese, similar compound verbs can have infixes, -de- or -bu-, to mean "capable" or negative. Examples are, taiqilai (raise), which has versions of taideqilai (can raise) taibuqilai (cannot raise), very similar to the above German verbs. Moreover, the phrases "taiqi tou lai", taideqi tou lai", and "taibuqi tou lai" are similar sentence structures using compound verbs (such as "ziehen dein Kapf auf".) From this point of view, "taiqilai" should be one word. However, many linguists consider those syllables as separate words (tai, qi, lai), and write them separately.
As illustrated above, in the Chinese language it is often unclear where word boundaries should be placed.
In spite of the controversy, many multiple-syllable words are universally recognized as minimal linguistic units, or morphemes, such as: (1) nouns "gada", "putao", "feiji", etc.; (2) verbs "zhuanyou", "xingwu", etc.; and (3) adjectives "heised", "pangdad", etc. Also, many phrases are universally accepted that consist of multiple words. For example, although sometimes "dianzigongye" can be considered as one word, no one would consider the phrase "fazhan dianzi dongye" as a single word. There are popular four-syllable idioms that are universally considered as words, although in different writing styles of Pinyin, hyphens may or may not be used. For these classes of words, unique word boundaries are universally recognized.
As described above, the lack of universally accepted orthographic rules and the lack a work-separation habit for Chinese, make it very difficult to develop a easily used standard for computer input and processing of Chinese language text--no particular linguistic school is universally followed. Even by following a narrow definition of words (i.e., treating many compound words as phrases, and treating many affixes as particles), some ambiguities will remain. By making a broad definition of words, (i.e., treating many compound words as single units, and accepting many affixes as part of words), the accuracy of identification will improve, but the volume of vocabulary required to be stored in the computer memory would be too large to account for every single unit word and words with all affix combinations.
Textbooks of spoken Chinese for foreigners are written in a spelling form called Pinyin, where multiple syllable words are considered as basic units. Pinyin uses Roman characters and has its vocabulary listed in the form of multiple syllable words. A Chinese Pinyin Vocabulary was published in 1964. A revised edition was published in 1989 by Language Press, Beijing, China, which contains some 60000 word entries. Rules of orthography for Chinese written in Pinyin form, that define the word boundaries, were published in 1984.